Human Motion Tracking Using 3D Image Features with a Long Short-Term Memory Mechanism Model—An Example of Forward Reaching

Human motion tracking is widely applied to rehabilitation tasks, and inertial measurement unit (IMU) sensors are a well-known approach for recording motion behavior. IMU sensors can provide accurate information regarding three-dimensional (3D) human motion. However, IMU sensors must be attached to the body, which can be inconvenient or uncomfortable for users. To alleviate this issue, a visual-based tracking system from two-dimensional (2D) RGB images has been studied extensively in recent years and proven to have a suitable performance for human motion tracking. However, the 2D image system has its limitations. Specifically, human motion consists of spatial changes, and the 3D motion features predicted from the 2D images have limitations. In this study, we propose a deep learning (DL) human motion tracking technology using 3D image features with a deep bidirectional long short-term memory (DBLSTM) mechanism model. The experimental results show that, compared with the traditional 2D image system, the proposed system provides improved human motion tracking ability with RMSE in acceleration less than 0.5 (m/s2) X, Y, and Z directions. These findings suggest that the proposed model is a viable approach for future human motion tracking applications.


Introduction
Rehabilitation is becoming an increasingly important issue owing to the rise in elderly population. According to the World Population Ageing 2020 report [1], 727 million people were aged 65 years or older around the world, and the number of the elderly was projected to rise to 1.5 billion by 2050. In addition, the World Health Organization (WHO) indicates that 15 million people suffer from stroke each year [2], with 75% of those being elderly [3]. This means that healthcare, such as motion rehabilitation to regain motor function, must be valued and improved. infrared light on the object, whereas the receiver obtains the reflected light and calculates the depth information through the time difference. In addition, lower cost and higher portability make it popular for research and even for daily use. Based on the advantages of the RGB-D sensors, Chen et al. used CNN technology to perform regression analysis of the IMU information, demonstrating that RGB-D sensor technologies not only improve the performance of human motion tracking but also provide more useful information of 3D features than of 2D features [23]. Moreover, Ma et al. used long short-term memory (LSTM) [24], a deep learning (DL) method proficient in temporal information processing, to rectify the kinematic information obtained from the ToF camera [25]. Specifically, the x-, y-, and z-axis acceleration information originating from human motion were captured and used as the regression goal. The results indicated that it is feasible to use the DL method to study the accuracy of human motion tracking systems. At the same time, using an LSTM-type model has advantages over temporal information processing. Based on the successful results noted above, the purpose of this study is to prove whether a deep bidirectional LSTM (DBLSTM) mechanism model [26] with 3D image features can further improve the performance of human motion tracking. Conventional models, including a deep neural network (DNN) [27] and CNN, were compared to investigate the benefits of temporal information in human motion tracking tasks. Note that the "arm reaching" task, an important training task in rehabilitation, was chosen as the test example in this study, as it is a common daily function for the upper extremities. In addition, the features of classic RGB images, known as 2D images, were used as the baseline system for comparison to investigate the benefits of the proposed system compared to 3D images.
Our main contributions of this study are as follows: (1) We investigate whether the time-related deep learning model structure, DBLSTM, is advantageous compared to several traditional models in human motion tracking tasks. Employing IMU sensors as our target provided x-, y-, and z-axis acceleration information, which are a kind of continuous signal and closely related to time clues. We took our system as a kind of regression approach to obtain the acceleration information from 3D image features, and the results demonstrated that DBLSTM indeed performed well in such kinds of tasks. (2) We delve into the potentiality of 3D image features to determine whether the data comprise more information that can be representative of human motion. The 3D image comprises features such as the relative relationship-i.e., involving vectors and Euclidean distances-between the human joints. Two-dimensional image features were used as a comparison. Data from several recording views were collected and evaluated through our proposed system, and the results did verify the high potentiality for 3D images.

The Proposed System
A block diagram of the proposed system, including the training and testing phases, is shown in Figure 1. In the training phase, the amplitude (M A ) and point cloud information (M P ) were recorded using a ToF camera. Specifically, M A refers to the energy of the images, which are 2D grayscale images; M P refers to the point cloud data, which are generated through the information of M A and phases that are obtained through the receiver on the ToF camera. Representations of M A and M P are shown in Figure 2. Next, M A was input to the OpenPose system to obtain the human joints (J 2D ), which were represented by twodimensional coordinates. Images were input into the OpenPose system, and human poses were estimated using the CNN. Confidence maps and part affinity fields were set up to detect human body parts and part associations from different people. Part affinity was first calculated through CNN to analyze the vector information of human poses. For example, based on the colors of the images, the vectors offered information linking the shoulder to the elbow, the elbow to the wrist, and so on. Subsequently, the obtained information was combined with the original input, that is, means the image, and sent to the next convolutional layers to analyze the positions of the human joints [10]. All of the joints from the bottom to the top were obtained through OpenPose. Based on our previous study [23], four human joints were selected for the "arm reaching" action: the right shoulder (RS), left sent to the next convolutional layers to analyze the positions of the human joints [10]. All of the joints from the bottom to the top were obtained through OpenPose. Based on our previous study [23], four human joints were selected for the "arm reaching" action: the right shoulder (RS), left shoulder (LS), left elbow (LE), and left wrist (LW). The LS point was chosen as the reference point. Note that these four points were the most efficient positions in the observation of "arm reaching" actions in our previous study.
where ( , ), ( , ), ( , ), and ( , ) are the joint coordinates of the right shoulder, left shoulder, left elbow, and left wrist, respectively; represents the Euclidean distances from the right shoulder, left elbow, and left wrist joints to the left shoulder joint. More specifically, can be expressed as: A 3D generator unit was used to combine the information of and [23]; assists the generator to determine the depth coordinates of the images from the distribution of the point cloud data, which is denoted by . The same feature selection as for the 2D images was adapted, with the joint of the left shoulder used as the reference point to  sent to the next convolutional layers to analyze the positions of the human joints [10]. All of the joints from the bottom to the top were obtained through OpenPose. Based on our previous study [23], four human joints were selected for the "arm reaching" action: the right shoulder (RS), left shoulder (LS), left elbow (LE), and left wrist (LW). The LS point was chosen as the reference point. Note that these four points were the most efficient positions in the observation of "arm reaching" actions in our previous study.
where ( , ), ( , ), ( , ), and ( , ) are the joint coordinates of the right shoulder, left shoulder, left elbow, and left wrist, respectively; represents the Euclidean distances from the right shoulder, left elbow, and left wrist joints to the left shoulder joint. More specifically, can be expressed as: A 3D generator unit was used to combine the information of and [23]; assists the generator to determine the depth coordinates of the images from the distribution of the point cloud data, which is denoted by . The same feature selection as for the 2D images was adapted, with the joint of the left shoulder used as the reference point to The features for the human joints in the 2D images (J 2D ) can be expressed as: where RS 2D (x, y), LS 2D (x, y), LE 2D (x, y), and LW 2D (x, y) are the joint coordinates of the right shoulder, left shoulder, left elbow, and left wrist, respectively; D 2D represents the Euclidean distances from the right shoulder, left elbow, and left wrist joints to the left shoulder joint. More specifically, D 2D can be expressed as: A 3D generator unit was used to combine the information of J 2D and M P [23]; J 2D assists the generator to determine the depth coordinates of the images from the distribution of the point cloud data, which is denoted by J 3D . The same feature selection as for the 2D images was adapted, with the joint of the left shoulder used as the reference point to obtain relative vectors and distances. In this case, the features for the human joints in the 3D images (J 3D ) can be expressed as: where RS 3D (x, y, z), LS 3D (x, y, z), LE 3D (x, y, z), and LW 3D (x, y, z) represent the joint coordinates for the right shoulder, left shoulder, left elbow, and left wrist, respectively; D 3D represents the Euclidean distances from the right shoulder, left elbow, and left wrist to the left shoulder. More specifically, D 3D can be expressed as: In the training phase, the J 2D and J 3D data were separately used as inputs for the 2D and 3D regression model tasks, respectively. The same target (G) was used for both to evaluate performance. The obtained J 3D was used as the input for the proposed DBLSTM model, which has the advantage of processing temporal information such as natural language and speech recognition tasks [26,28]. Therefore, it was assumed that the DBLSTM model would provide further benefits for tracking tasks because human motion is a type of spatial movement and continuous signal. The synchronously recorded data G from the IMU sensor was used as the output of the DBLSTM model. The G signals included the x-, y-, and z-axis acceleration information, which can be expressed as: where Acc x , Acc y , and Acc z are the accelerations of the x-, y-, and z-axis, respectively, obtained from the IMU sensors. The mean squared error was used as the loss function to ensure convergence of the model and evaluate the predicted results. The related equation can be expressed as: where t indicates the total frame of the training data; G i andĜ i are the training target and predicted result in the ith time frame, respectively; W * and B * represent the optimal weight and bias parameters of the training model, respectively. Finally, the ToF camera was used to record the human motion and subsequently obtain the predicted motion information in the testing phase.

Materials
The data from the IMU sensors and the ToF camera were recorded at 50 fps and were temporally aligned. "Arm reaching" [29] was used as the target task because it is the most commonly executed motor function by the upper extremities and a major focus of stroke rehabilitation. The setup of the recording environment is shown in Figure 3. During the experiment, the IMU sensors were placed on the subject's left wrist, and the ToF camera was placed in front of the subject. The subject was seated behind a cup with their arm resting on the table. During the repetition of the task, the subject was asked to perform the arm reaching task repetitively. The whole sequences of "arm reaching" included: (1) the subject extending the left arm and using the left hand to pick up and raise the cup to the height that matched their mouth, and (2) the subject returning the cup to the starting point. As the experiment started, the subject would first raise the hand on the table for approximately 5 s; then, the "arm reaching" motion started. Until the time was up, the subject had to lower their hand. An Xsens IMU sensor [30] and RL100 Pro camera were used [31]. The IMU sensor was placed flat on the table before the experiment started to ensure that the calibrated x-, y-, and z-axis directions were the same in every recording. Considering the inherent deficit of cumulative error from the accelerometers, they were calibrated before the experiments were held. The data had been analyzed to compensate for accelerator errors, resulting in the minimization of cumulative errors. Data were recorded for a total of 1.5 h in three types of views (S 1 , S 2 , and S 3 , as shown in Figure 3). The positions of S 1 and S 3 represented the subject sitting at relative angles of 0 • (frontal view) and 30 • , respectively. These two recording views were used as training data (or defined as known-view data), whereas the S 2 recording view (sitting at a relative angle of 15 • ) was used as the testing data (or defined as unknown-view data). Note that the known-view data means the DL model recognizes the recording view of the data, and the unknown-view data means the DL model does not recognize the recording view of the data. Three different recording views were used to simulate the real conditions. used [31]. The IMU sensor was placed flat on the table before the experiment started to ensure that the calibrated x-, y-, and z-axis directions were the same in every recording.
Considering the inherent deficit of cumulative error from the accelerometers, they were calibrated before the experiments were held. The data had been analyzed to compensate for accelerator errors, resulting in the minimization of cumulative errors. Data were recorded for a total of 1.5 h in three types of views (S , S , and S , as shown in Figure 3). The positions of S and S represented the subject sitting at relative angles of 0° (frontal view) and 30°, respectively. These two recording views were used as training data (or defined as known-view data), whereas the S recording view (sitting at a relative angle of 15°) was used as the testing data (or defined as unknown-view data). Note that the known-view data means the DL model recognizes the recording view of the data, and the unknown-view data means the DL model does not recognize the recording view of the data. Three different recording views were used to simulate the real conditions.

Methods
Two experiments were conducted: the first was for the comparison of model performance, and the second was for the comparison between the performances of the 2D and 3D systems. The purpose of the first experiment was to address the question of whether the DBLSTM model could demonstrate better results in the human motion regression task. In the second experiment, the DBLSTM model was further used to compare the performance of 2D and 3D image features. We recorded several types of data acquired from different recording views, as shown in Figure 3, but the same type of motion to simulate the real conditions for daily life and rehabilitation use. Considering the conditions for the users, the recording views in every system operation may not be sufficiently similar. Consequently, the system should tolerate some displacement resulting from the recording view. Therefore, the second experiment was conducted to analyze whether 2D or 3D image features can be applied to the usage of different recording views, leading to a more robust system.
In the first experiment, the DNN and CNN models were compared with the DBLSTM model. The DNN structure is the most commonly used model in machine learning and includes several hidden layers consisting of multilayer perceptrons [27] that calculate the relationship between the input and output using back propagation technology [32]. A DNN was selected as our baseline owing to its simple structure. CNN is a well-known DL structure frequently used in image recognition and signal analysis [33], which obtains spatial or temporal information through convolutional calculations. In our previous study, a CNN was used to regress the information of IMU signals from 3D images, and it was proven to provide suitable performance [23]. Here, the known-view data were used to evaluate the performance of the system. The data were split into training and testing sets at a ratio of 9:1. The setting information of the three models is listed in Table 1. In  Figure 3. Setup of the recording environment: S 1 and S 3 stand for the known-view data, whereas S 2 stands for the unknown-view data.

Methods
Two experiments were conducted: the first was for the comparison of model performance, and the second was for the comparison between the performances of the 2D and 3D systems. The purpose of the first experiment was to address the question of whether the DBLSTM model could demonstrate better results in the human motion regression task. In the second experiment, the DBLSTM model was further used to compare the performance of 2D and 3D image features. We recorded several types of data acquired from different recording views, as shown in Figure 3, but the same type of motion to simulate the real conditions for daily life and rehabilitation use. Considering the conditions for the users, the recording views in every system operation may not be sufficiently similar. Consequently, the system should tolerate some displacement resulting from the recording view. Therefore, the second experiment was conducted to analyze whether 2D or 3D image features can be applied to the usage of different recording views, leading to a more robust system.
In the first experiment, the DNN and CNN models were compared with the DBLSTM model. The DNN structure is the most commonly used model in machine learning and includes several hidden layers consisting of multilayer perceptrons [27] that calculate the relationship between the input and output using back propagation technology [32]. A DNN was selected as our baseline owing to its simple structure. CNN is a well-known DL structure frequently used in image recognition and signal analysis [33], which obtains spatial or temporal information through convolutional calculations. In our previous study, a CNN was used to regress the information of IMU signals from 3D images, and it was proven to provide suitable performance [23]. Here, the known-view data were used to evaluate the performance of the system. The data were split into training and testing sets at a ratio of 9:1. The setting information of the three models is listed in Table 1. In addition, batch normalization [34] was used in each hidden layer, and the Nadam optimizer [35] was used to optimize the models. The initial learning rate was set to 0.0005, and the batch size was set to 1024. In the second experiment, the performance between the 2D and 3D features was determined. The DBLSTM model, which had the best performance (see details in Section 4.1) in the first experiment, was also used in this experiment. The recording views from S 1 and S 3 were used to train the DL model. Two parts were set up to evaluate the robustness of the performance under different conditions, defined as the known-view and unknownview performances. The known-view performance used all the recording views of the testing data that were recognized in the system, whereas the unknown-view performance used distinct recording views of the testing data that were not recognized in the system. The same training data were used as in the first experiment, which included recording views from S 1 and S 3 . In addition, the recording view from S 2 was used to evaluate the unknown-view performance.

Evaluation
The RMSE [36], percent root mean square error (PRMSE) [37], and coefficient of multiple correlations (CMC) [38] were used to determine the performance of the proposed system. Each axis was separated to evaluate the performance. The RMSE was used to indicate the differences between the labeled data and predicted results. The PRMSE, which is a method based on the RMSE, provided a more accurate index, which was determined using the mean of the data. The PRMSE has the advantage of demonstrating better precision because of the variances of the three axes. The PRMSE equation is expressed as: where G i andĜ i represent the testing target and predicted result in the ith time frame, respectively. In this type of method, it helps to evaluate the percentage of the error that is produced by the system. The CMC indicates whether the relationship between G i andĜ i is well connected and is expressed as: where G is the average of the test target. The evaluation of the CMC can be found in [39].
The CMC value was limited to 0−1, where CMC ≈ 1 indicates an excellent relationship between the testing target and predicted results, 0.5 < CMC < 0.99 indicates a good relationship, 0.16 < CMC < 0.49 indicates a moderate connection, and CMC ≤ 0.16 indicates a poor relationship. The predicted results provided three types of information: the x-, y-, and z-axis accelerations, which were evaluated separately. Here, lower RMSE and PRMSE values with a higher CMC value indicate that the system exhibited better performance.

Model Performance Comparison
The results of the model performance comparison, with the best performances marked in bold, are listed in Table 2. First, all the models demonstrated good performance with CMC > 0.9 and PRMSE < 5%, which means that using DL methods can obtain good performance in human motion tracking tasks. Moreover, the DBLSTM model achieved the best CMC performance with the least RMSE and PRMSE among these models. Specifically, the DBLSTM model predicted the most accurate results compared with the labeled data. CNN was ranked second, and DNN performed the worst. The performance might depend on the characteristics of each model. The DBLSTM model holds an advantage when processing temporal information. No studies have compared the performance for the three types of DL methods simultaneously in human motion tracking tasks. Several studies in the field of speech recognition and human activity recognition showed that DBLSTM exhibited a more stable and accurate performance compared with such kinds of traditional DL models [28,39]. The proposed DBLSTM model considered the IMU acceleration as the target of the regression model and proved that the method was feasible. Even though the concept of the DBLSTM is not that novel, it is a state-of-the-art model in terms of analyzing temporal information. DBLSTM comprises LSTM units, which are characteristic of its forget gate, input gate, and output gate. The forget gate includes a sigmoid function to manipulate the information originating from the previous unit, whereas the input and output gates include sigmoid and hyperbolic tangent functions to reinforce the important information originating from its unit mixed with the message coming from the forget gate. The most important information can be extracted and passed to the next LSTM unit. Based on the message that human motion is a continuous signal, adequate parameters manipulated by time sequence can make the model perform better, and the increased hidden layers help the model be more robust. Moreover, CNN can capture subsequent information through kernels and convolution. The same features were used to demonstrate a similar performance, but the performance shows the limits. DNN owns the fundamental concept in deep learning, and it does exhibit the basic performance. It is clear that acceleration information is highly related to the temporal changes. In conclusion, the DBLSTM model succeeded in extracting useful information, obtaining more accurate results than the CNN and DNN models.

Comparison of 2D/3D Performance
Based on the results of the first experiment, we selected DBLSTM to evaluate the performance of the 2D and 3D experiments. The results of the 2D and 3D performance experiments are shown in Figure 4. The horizontal axis represents the number of time frames, whereas the vertical axis shows the value of the acceleration data. The results of the x-, y-, and z-axis accelerations from known-view data are shown in Figure 4a-c, respectively. The results of the x-, y-, and z-axis accelerations from the unknown-view data are shown in Figure 4d-f, respectively. There are three lines in each subplot: blue represents the testing target from the IMU sensors, yellow represents the predicted results from 2D image features, and orange represents the predicted results from the 3D image features. The 3D image features performed slightly better than the 2D image features, as shown in Figure 4a-c. More detailed information about the known-view and unknown-view performances are listed in Tables 3 and 4, respectively. Using 3D features to regress human motion information was confirmed to be more accurate than using 2D features, as shown in Table 4. Both of the features in the known-view conditions achieved good performance with CMC > 0.9 and PRMSE < 5%. We think that the DBLSTM model was successful in extracting the features of human motion based on its memory cell structure. In addition, the amount of motion data was sufficient for the system to handle this task. Our results demonstrate that using 2D images could achieve a similar accuracy level as 3D images from the known recording views of the data. However, for the unknown recording views of the data, the 2D image had increased prediction errors, whereas the 3D image maintained high accuracy, as shown in Figure 4d-f.

Comparison of 2D/3D Performance
Based on the results of the first experiment, we selected DBLSTM to evaluate the performance of the 2D and 3D experiments. The results of the 2D and 3D performance experiments are shown in Figure 4. The horizontal axis represents the number of time frames, whereas the vertical axis shows the value of the acceleration data. The results of the x-, y-, and z-axis accelerations from known-view data are shown in Figure 4a-c, respectively. The results of the x-, y-, and z-axis accelerations from the unknown-view data are shown in Figure 4d-f, respectively. There are three lines in each subplot: blue represents the testing target from the IMU sensors, yellow represents the predicted results from 2D image features, and orange represents the predicted results from the 3D image features. The 3D image features performed slightly better than the 2D image features, as shown in Figure 4a-c. More detailed information about the known-view and unknownview performances are listed in Tables 3 and 4, respectively. Using 3D features to regress human motion information was confirmed to be more accurate than using 2D features, as shown in Table 4. Both of the features in the known-view conditions achieved good performance with CMC > 0.9 and PRMSE < 5%. We think that the DBLSTM model was successful in extracting the features of human motion based on its memory cell structure. In addition, the amount of motion data was sufficient for the system to handle this task. Our results demonstrate that using 2D images could achieve a similar accuracy level as 3D images from the known recording views of the data. However, for the unknown recording views of the data, the 2D image had increased prediction errors, whereas the 3D image maintained high accuracy, as shown in Figure 4d-f.   The detailed performance results are presented in Table 4. The 2D features clearly performed well with increased RMSE and PRMSE, and a low CMC. Conversely, the 3D features achieved good performance with CMC > 0.9 and PRMSE < 10%. The difference between the 2D and 3D image features was the depth information, which can be referred to as the z-axis information in J 3D . The depth information assisted the system to obtain spatial messages when subjects were performing motions, thus improving the accuracy. In addition, every kind of recording view was different for the 2D images because the features lacked depth information. Even though a subject performed the same type of motion, it appeared to be different for 2D image features because of the different recording views. The 2D system performed well under the known-view condition but was challenged under the unknown-view condition. This implies that 3D image features have an advantage over 2D image features in unknown-view applications. The CMC in the unknown-view performance was as good as in the known-view performance. However, the PRMSE in the unknown-view performance appeared slightly higher than that in the known-view performance. The linear relationship between the predicted results and the IMU data was good, and the PRMSE appeared to be slightly higher than expected. It is assumed that the errors could be decreased by increasing the recording views for the provided human motion data.

Conclusions
This study proposed a DL-based system to track human motion using 3D image features with a DBLSTM model. Three-dimensional image features were combined with the DL model to improve the accuracy of human motion tracking and increase the tolerance of the recording view. The "arm reaching" movement was used as a target task to evaluate the performance, which showed that such features could succeed in regressing motion information. A depth camera was used to capture the motion of the subject, providing adequate information for doctors or therapists to evaluate important features of the patients' performance. The main contributions of this study are as follows.
(1) Classic DL models were used to compare the performance of a human motion regression task. The results demonstrated that the DBLSTM model had an advantage over DNN and CNN models because of its ability to retain temporal information. Because the acceleration information from IMU sensors are continuous signals, models that can process time sequences are advantageous. As we can see, the DBLSTM was demonstrated to predict the most accurate results based on similar parameters.
(2) The model's predictability with 2D and 3D image features was compared, and the data from multiple recording views were used to confirm the advantages of the DBLSTM model. The results showed that the 3D system demonstrated superiority in recording the spatial information of human motion. This is because 3D image features contained depth information that was able to capture spatial changes. The application of 3D image features was first discovered to demonstrate the potential for unknown-view applications. The relative relationships between chosen human joints were the features used. The information includes the relative vectors and Euclidean distances. In our opinion, the features help build a robust connection between joints that carry out the same motion type from different recording views. The relative information from different recording views seems to be stable in 3D image features. As we can see, the left elbow moves relative to the left wrist with the same motions, which can be defined as a rigid relationship. The unknown recording view S 2 is likely to refer to the interpolation of the perspective of known views S 1 and S 3 , and it shows that applications for unknown views are likely to succeed. However, we cannot guarantee whether other recording views, especially referring to extrapolations in 45 • , 60 • , or higher, can achieve the same performance as in the interpolated case.
In summary, the proposed system has the potential for clinical application. Looking ahead to the application of tele-rehabilitation, there are increasingly more opportunities to use cameras to track and evaluate the condition of patients. The proposed system provided good information that indicated the acceleration of human motion. However, considering future applications in clinical use, more data from different recording views may be needed. The proposed system demonstrated potential for use in tele-rehabilitation, but its practicality must be verified through subsequent studies. We also recognize that the calibration process only reduces but does not eliminate the motion integration error based on the acceleration retrieved from the accelerometer. In practical applications, it is still necessary to analyze the error function that changes over time, so that the solution we propose can perform better. In future applications that use camera sensors with depth photography capabilities, patients will not need to wear any sensors to record the signals, and body movement could be tracked with a single camera, resulting in a convenient and practical way to improve the delivery of the clinical application.